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Summary 

We  consider  the  usual  proportional  hazards  model  in  the  case  where  the  baseline  hazard,  the 
covariate  link  and  the  covariate  coefficients  are  all  unknown.  Both  the  baseline  hazard  and 
the  covariate  link  are  monotone  functions  and  are  characterized  nonpar ametrically  using  a 
dense  class  arising  as  a  mixture  of  Beta  distribution  functions.  We  take  a  Bayesian  approach 
for  fitting  such  a  model.  Since  interest  focuses  more  upon  the  likelihood,  we  consider  vague 
prior  specifications  including  Jeffreys ’s  prior.  Computations  are  carried  out  using  sampling- 
based  methods.  Model  criticism  is  also  discussed.  Finally,  a  data  set  studying  survival  of  a 
sample  of  lung  cancer  patients  is  analyzed. 


Key  words:  Bayesian  model  analysis;  Gibbs  sampler;  Mixture-of-Betas  model;  model 
criticism;  survival  analysis. 
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1  Introduction 


Probablistic  models  and  statistical  analysis  for  technological  and  medical  survival  data  have 
garnered  much  attention  over  the  past  twenty  years.  Adopting  the  medical  setting  data  is 
usually  obtained  for  a  sample  of  individuals.  For  the  ith  subject,  data  consists  of  either  an 
observed  survival  time  t,  or  a  (right)  censorship  time  (in  which  case  U  >  u,)  and  a  set  of 
explanatory  variables  or  risk  factors  denoted  by  the  pxl  vector  xt. 

Survival  distributions  are  usually  characterized  by  their  hazard  function  which  is  the 
conditional  density  function  at  time  t  given  survival  up  to  time  t.  In  customary  modeling, 
the  hazard  for  an  individual  is  assumed  to  be  a  function  of  the  covariates  as  well  as  t. 
To  accommodate  censored  survival  times  we  also  require  the  survivor  function  which  is  the 
upper  tail  probability  function  of  the  survival  distribution.  When  survival  times  are  assumed 
to  be  continuous  measurements,  a  widely  used  class  of  hazard  functions  takes  the  so-called 
proportional  hazards  form, 

k(t)  =  h0(t)g(x)  (1) 

with  ho  and  g  nonnegative. 

The  integrated  hazard  associated  with  h(t)  is  H(t)  =  Ho(t)g(x)  where  H0(t)  =  /0‘  ho(u)du. 
The  survivor  function  is  S(f;  x)  =  exp(— H0(t)g(x))  and  the  density  for  t  takes  the  form 

=  (2) 

Implicit  in  (1)  is  the  fact  that  the  random  variable  U(t-,x)  =  Ha(t)g(x)  has  an  Exp(l) 
distribution.  We  shall  make  use  of  this  later.  The  function  hQ  is  called  the  baseline  hazard 
associated  with  a  baseline  survival  distribution,  i.e.,  when  g=  0.  If  for  example,  this  baseline 
distribution  is  a  Weibull,  the  associated  ho  has  a  parametric  form  h0(t)  =  ptp~l .  The 
integrated  baseline  hazard,  Ho{t)  as  defined  above  is  clearly  a  nondecreasing  function  of  t. 

Dating  to  Cox  (1972),  the  function  g  is  customarily  taken  to  have  the  parametric  form 
exp(/3Tx)  where  /3  is  an  unknown  pxl  vector  of  coefficients.  Drawing  upon  jargon  for 
generalized  linear  models,  g  can  be  more  generally  thought  of  as  a  covariate  link  function. 
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In  fact,  letting  rj  =  (3Tx,  because  EU(t;  x)=l,  EH0(t)  =  1  /g(rj).  Since  EH0(t )  increases 
as  E(t)  increases  and  E(t)  decreases  in  g(rj)  we  have  that  g  is  a  non  decreasing  function 
of  fj.  However  since  (3Tx  can  include  polynomial  forms  for  any  covariate,  g  need  not  be  a 
monotonic  function  of  any  covariate. 

We  shall  assume  that  Ho  is  an  unknown  function  from  R+  — >  R+  and  that  g  is  an 
unknown  function  from  R 1  — *•  R+  which  depends  upon  x  through  t}  =  (3Tx  with  (3  an 
unknown  parameter  vector.  Nonparametric  estimation  of  ho,  hence  H0,  usually  proceeds 
from  a  piecewise  constant  form  for  kQ  as  in  Breslow  (1974).  See  Cox  and  Oakes  (1984)  for  a 
more  general  discussion.  If  g  is  specified  and  /3  is  obtained  by  partial  likelihood  maximization 
(Cox,  1975)  then  ho  and  Ho  are  determined.  If  the  full  likelihood  is  used  maximization  is 
carried  out  iteratively,  first  for  ho  given  (3,  then  for  /3  given  ho,  etc.,  until  convergence. 

Conventional  (wisdom  offered  for  instance  in  Cox  and  Oakes,  1984  or  McCullagh  and 
Nelder,  1989)  asserts  that  if  ho  is  estimated  nonparametrically,  the  overall  fit  is  insensitive 
to  the  choice  of  g(rj),  supporting  the  usuage  of  the  mathematically  convenient  form  en. 
However  nonparametric  estimation  of  g  has  recently  received  attention.  For  instance,  in 
O’Sullivan  (1988)  and  in  Hastie  and  Tibshirani  (1990),  it  is  assumed  that  g(x)  =  eeW  where 
0(x)  is  an  arbitrary  additive  function  in  the  components  of  x,  i.e.,  0(x)  =  £f=1 0i(xi).  Partial 
likelihoods  are  used  so  that  g  is  estimated  independently  of  Hq.  We  note  that  the  form  e8^ 
does  not  include  our  assumed  form  g((3Tx).  Moreover,  an  estimate  of  g(rj)  is  easily  compared 
with  en,  which  we  may  think  of  as  a  “baseline”  covariate  link.  Staniswalis  (1989)  considers 
joint  estimation  of  unknown  Ho  and  g  with  only  a  single  dichotomous  covariate.  Indeed  if 
x  is  a  single  continuous  covariate,  i.e.,  rj  =  /So  +  (3\X,  then  Ho,  g,  fio  and  /3i  need  not  be 
identifiable  in  the  likelihood.  Staniswalis ’s  approach  uses  kernel  estimators  resulting  in  the 
maximization  of  a  weighted  likelihood.  Her  fitting  algorithm  is  also  iterative,  maximizing 
over  Ho  given  g,  then  over  g  given  a  new  Ho,  etc. 

We  model  Ho  and  g  using  a  dense  class  of  monotone  functions  from  R+  to  R+  and  from 
Rl  to  R+  respectively.  The  classes  arise  from  mixtures  of  Beta  distribution  functions  and 
have  not  been  previously  considered  in  this  context.  We  adopt  a  Bayesian  framework  to 


2 


fit  such,  a  model  thus  assuming  f3,  Ho  and  g  are  random.  Bayesian  fitting  yields  an  entire 
posterior  distribution  for  a  model  unknown  rather  than  merely  a  point  estimate  and  perhaps 
a  precision  estimate.  Inference  is  exact  rather  than  asymptotic.  If  useful  prior  information 
is  available  say  about  coefficient  parameters  we  would  be  happy  to  use  it.  However,  inter¬ 
est  usually  focuses  more  upon  the  likelihood  whence  we  would  tend  to  use  noninfonnative 
priors  (see  section  3  and  the  example  of  section  5.).  With  large  data  sets  such  automatic 
Bayesian  inference  will  be  close  to  that  arising  under  classical  maximum  likelihood  analysis. 
For  smaller  data  sets  the  Bayesian  approach  would  be  expected  to  provide  more  believable 
estimates  of  variability  than  under  likelihood  analysis.  Recent  advances  in  Bayesian  com¬ 
putation  through  the  use  of  sampling  based  methods  (Gelfand  and  Smith,  1990;  Smith  & 
Gelfand,  1993)  enable  reasonably  straightforward  fitting  of  such  models.  Bayesian  fitting  of 
fully  parametric  proportional  hazards  models  using  sampling  based  methods  is  discussed  in 
Dellaportas  and  Smith  (1992). 

Our  approach  models  the  strictly  monotone  functions  H0  and  g,  each  transformed  by  a 
suitable  monotone  transformation  to  have  range  in  [0,1],  as  unknown  cumulative  distribution 
functions.  In  the  process  the  resulting  domain  also  becomes  [0,1].  We  thus  characterize  each 
function  as  a  mixture  of  Beta  distribution  functions.  Here  we  appeal  to  a  well  known  result 
as  in,  e.g.,  Diaconis  and  Ylvisaker  (1985),  which  says  that  any  continuous  density  on  [0,1]  can 
be  arbitrarily  well  approximated  by  a  discrete  mixture  of  Beta  densities.  Unlike  distributions 
arising  under  a  Dirichlet  process,  which  could  also  be  used  here,  we  have  a  continuous,  dense 
class  of  distributions  admitting  an  explicit  form.  In  practice  we  have  treated  the  number 
of  mixands  r,  as  fixed  comparing  various  choices.  Alternatively,  a  discrete  prior  could  be 
attempted.  As  such  we  have  taken  an  infinite  dimensioned  problem  and  converted  it  to 
a  finite  dimensioned  one.  Though  arbitrarily  high  dimensional  models  can  be  employed, 
happily,  in  practice  they  will  not  usually  be  needed.  In  experimenting  with  a  range  of  r’s,  for 
a  number  of  examples,  we  have  discovered,  perhaps  not  surprisingly,  that  robustness  occurs 
with  quite  small  r.  In  introducing  randomness  to  this  finite  mixture  form  we  could  either 
assume  the  mixture  weights  to  be  random  or  the  parameters  of  the  Beta  densities  to  be 
random.  Mathematically  it  is  much  simpler  to  work  with  the  former.  Hence  for  a  given  r  we 
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choose  a  fixed  set  of  Beta  densities  and  then  assume,  a  priori,  that  the  mi-ring  weights  arise 
as  a  random  draw  on  the  r- dimensional  simplex.  Earlier  work  with  this  approach  (Mallick 
and  Gelfand,  1993)  looked  at  the  problem  of  fitting  a  generalized  linear  model  when  the  link 
function  is  assumed  unknown. 

The  format  of  the  paper  is  thus  as  follows.  In  section  2  we  formalize  the  likelihood, 
in  particular  the  modeling  of  the  unknown  S0  and  g.  In  section  3  we  discuss  the  prior 
specification  needed  to  complete  the  Bayesian  model.  Section  4  describes  briefly  the  sampling 
based  fitting  of  this  model.  Model  criticism  and  comparison  are  the  subject  of  section  5. 
Finally,  in  section  6  we  analyze  a  set  of  survival  data,  with  censoring,  on  40  advanced  lung 
cancer  patients  which  is  discussed  in  Lawless  (1982). 

2  Likelihood  specification 

Suppose  a  total  of  n  subjects.  For  the  ith  individual  let  7 ,•  =  1  if  U  >  v =0  otherwise 
and  let  yi  =  Vi).  Then  the  joint  density  of  the  sample,  following  from  (2),  is 

nMy.M0Txi)]  exp(-H0(yi)g(PTXi))  (3) 

v=i 

With  H0ig  and  /3  unknown  we  may  view  (3)  as  a  likelihood  function  L({3,  Ho,  g).  Indeed  L 
has  an  infinite  dimensional  argument;  without  further  assumptions  it  need  not  be  identifiable. 
For  example,  if  /3rx  =  ft  +  fts  then  H0,  g,  ft,  and  ft  cannot  be  identified. 

As  noted  in  section  1,  our  inference  approach  is  Bayesian  requiring  the  specification  of 
a  prior  f((3,Ho,g)  whence  L((3,Ho,g)-f(/3,Ho,g)  is  the  Bayesian  model.  In  section  3  we 
discuss  forms  for  the  prior  /.  The  identifiability  question  from  a  Bayesian  point  of  view 
becomes  whether  or  not  the  data  can  inform  about  all  of  the  unknown  parameters  in  the 
model.  If  yes,  provided  a  proper  posterior  results,  there  is  no  identifiability  problem.  If 
no  and  if  /  is  improper  then  the  posterior  necessarily  is  as  well  and  we  have  an  ill-defined 
Bayesian  model.  If  no  and  if  /is  proper  then  the  prior  drives  the  posterior. 

How  shall  we  model  H0  and  g ?  From  the  previous  section  we  assume  Ho  is  a  strictly 
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increasing  function  from  R f  onto  J?+  and  that  g  is  a  strictly  increasing  function  from  R1  onto 
R +.  Let  JQ  =  ooHo/iaoHo  +  bo)  where  ao,bo>  0  are  specified  constants  (choice  of  oo,  &o  will 
be  taken  up  below).  Then  Jo  is  a  differentiable  c.d.f.  Modeling  Jo  is  equivalent  to  modeling 
an  unknown  distribution  function.  Similarly,  let  Jg  =  a1/(ai5  +  &i)  where  a*,  >  0  (choice 

of  Oi,  i>i  will  be  clarified  below).  Then  again  Jg  is  a  differentiable  c.d.f. 

A  rich  class  of  models  for  Jo  and  for  Jg  may  be  created  as  follows.  Associate  with  Ho  a 
specific  baseline  hazard  function  Ho-  For  instance  Ho(t)  might  be  tp  for  a  given  p.  (Random  p 
could  be  handled  using  a  hyperprior  but  we  have  not  investigated  this).  Similarly,  associate 
with  g  a  specific  baseline  covariate  link  g.  For  instance  g(g)  might  be  ev.  Diaconis  and 
Ylvisaker  (1985)  argue  that  discrete  mixtures  of  Beta  densities  provide  a  continuous  dense 
class  of  models  for  densities  on  [0,1].  A  member  of  this  class  has  a  density  of  the  form 

/(“)  =  XI  wiBe{u\ci,  di)  (4) 

i= l 

where  r  denotes  the  number  of  mixands,  wi  >  0,  wi  =  1  and  Be(u\q,  dt)  denotes  the  Beta 
density  in  standard  form  with  parameters,  q  and  d\.  If  IB{u\q,di )  denotes  the  incomplete 
Beta  function  associated  with  Be(u\q,di)  then  let 

ri  rj  _ 

J0(<;  Wi)  =  ^  wuIB(J0(t)]  cu,  du)  and  Jg{rj]  w2)  =  £  w2lIB{jg(-q);  c2/,  d2i)  (5) 

J=i 

where  J0(t )  =  aoHo(t)/(aoH0(t)  +  bo)  and  Jg(ri)  =  fli|(7)/(oi^(»7)  +  h).  Clearly  J0  and  Jg 
are  c.d.f.’s. 

It  will  be  of  interest  in  subsequent  sections  to  calculate  ho(t)  =  H'a(t)  and  g{rj).  Since 
Hq  =  boJo/ao(l  -  Jq),  Hq  =  ( bodJo/dt)/ao(l  -  Jo)2-  Similarly  g  =  &i(l  -  J9)/axJ,  so  g  = 
{-bi  dJg/&n)/aiJ%.  From  (5),  d J0(t;  w  J/dt  =  J'q  WiiBe(j0(t)-,  cu,  du),  dJg(v;  ^2)/dr]  = 
j'gTZ:iw2iBe(jg(Ti)-,C2i,d2i)  with  j'Q  =  aoboH'0l{aoHo  t  b0)2  and  Jg  =  -aibig'0/(aig  +  ij)2. 

We  note  that  models  incorporating  mixtures  other  than  Beta  could  be  used,  e.g.,  Gammas 
on  R+,  uniforms  on  R1.  We  work  with  Beta  densities  since  the  restriction  to  the  bounded 
interval  [0,1]  enables,  for  a  given  r,  convenient  choice  of  the  set  of  q,  di  so  that  mixing  with 
a  weight  vector  w=(  wl}  -  ■  •  wr)  belonging  to  the  r-dimensional  simplex  yields  a  rich  family 
of  densities. 
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In  our  experience,  inference  using  mixtures  with  small  r  is  virtually  indistinguishable 
from  those  with  much  larger  r.  In  fact,  allowing  r>n  does  not  insure  perfect  fit  since  H0 
and  g  are  restricted  to  be  monotone.  Hence  we  specify  r  making  (3)  a  finite  dimensional 
likelihood.  Given  r,  it  is  natural  and  certainly  mathematically  easier  to  assume  that  the 
component  Beta  densities  are  specified  but  that  the  weights  are  unknown.  In  particular  we 
choose  the  set  of  q,  di  to  provide  a  collection  of  Beta  densities  which  blanket  [0,1].  Within 
this  objective,  our  experience  shows  that  inference  is  again  robust  to  the  particular  choice. 
In  our  example  we  set  rx  =  r2  =  r,  cu  =  c2j  =  q,  du  =  d2i  =  d\  with  q=AZ,  d\  =  A(r-fl-/). 
(See  section  3  for  further  discussion  of  this  point.)  In  any  event,  specification  of  Ho  and  g  is 
equivalent  to  specification  of  wx  and  w2  and  we  can  denote  the  likelihood  by  L((3,  wx,w2). 

Lastly  we  return  to  the  choice  of  ao,bo  and  ai,b\.  This  is  not  a  modeling  issue  but  a 

computational  one.  For  example,  if  ]j?|  is  very  large,  g(g)  =  eP  will  produce  over  or  underflow. 

Since,  in  (5),  only  Jg{t} )  is  needed,  scaling  and  centering  using  appropriate  ax  and  b\  can 

alleviate  this  problem.  In  practice  ax  and  could  be  obtained  by  looking  at  the  range  of 
•  T  -  . 

/3  x  over  the  sample  with  /3  say  the  partial  likelihood  maximizer.  For  J0(t),  often  oo  =  &o=l 
will  work.  Again,  there  is  no  notion  of  “best”  values.  Many  choices  of  a*  and  ft,  will  work 
equally  well  and,  to  keep  notation  simpler,  we  suppress  the  o i  and  bt  in  the  sequel. 


3  Prior  specification 

Given  the  likelihood  L(/3,  w1:w2)  we  next  address  the  specification  of  the  prior  f((3,  wx,  w2). 
Since  primary  interest  is  in  the  likelihood  and  since  only  occasionally  will  there  be  useful 
prior  information  we  think  in  terms  of  vague  priors.  In  this  regard  one  advantage  does  ac¬ 
crue  to  the  Bayesian  compared  with  the  likelihoodist.  Maximum  likelihood  estimators  under 
L(/3,  wx,  w2)  need  not  be  finite,  i.e.,  need  not  occur  within  the  interior  of  the  parameter  space 
(see  Wedderburn,  1976).  However  in  the  Bayesian  context  the  prior  distribution  typically 
overcomes  this  problem  yielding  a  well  behaved  posterior  density. 

We  assume  that  / 03,wx,w2)=  /Q3|w!,  w2)./(w1)./(w2)  where  /(wi)  is  a  distribution 
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on  the  r,-  dimensional  simplex.  Since  Wj  determines  H0(t)  we  might  attempt  to  choose  /(w:) 

such  that  in  some  senses  H0(t)  is  centered  around  i.e.,  J0(t;  Wj)  is  centered  around 

J0(t).  Prom  (4)  this  corresponds  to  centering  the  random  density  /(u)  around  a  uniform 

density  on  [0,1].  Centering  using  the  mean  results  in  the  equation 

n 

53  E(wuIB(u;  ci/,  du)  =  u  (6) 

i=i 

If  Wi  ~  Dtr(7l)  then  (6)  requires  that  the  average  of  the  IB( u\  cu,  du)  equal  u.  If  rx  is  even 
and  cu  and  du  are  chosen  as  discussed  in  section  2  then  straightforward  calculation  (which 
we  omit)  shows  that,  to  a  first  order  approximation,  (6)  holds.  Similar  remarks  apply  to 
/(w2)  with  regard  to  centering  g(ij )  around  g(r]). 

Returning  to  /(/ 3|wx,w2)  a  multivariate  normal  prior  is  often  proposed.  In  the  limit, 
as  the  precision  matrix  tends  to  0,  a  flat  prior  results.  It  is  interesting  to  investigate  the 
form  of  JeflEreys’s  prior  here.  Recall  that  this  prior  is  the  square  root  of  the  determinant 
of  the  Fisher  information  matrix  associated  with  L(/3,  wx,  w2).  Given  w:  and  w2,  standard 
calculation  shows  that  Jeffreys’s  prior  is  proportional  to  |XTMX|1/2  where  X  is  the  n  x  p 
matrix  whose  rows  are  the  xf ’s  and  M  is  an  n  x  n  diagonal  matrix  such  that 

Ma  =  -(1  -  S(v»;  xi))(g‘(/3Txi)/g{{3Txt))2  (7) 

Working  with  (3),  this  calculation  is  clarified  by  noting  that  J5(l  —7/  —  Ho(yi)g((3TXi))  =  0. 
Prom  (7)  we  see  that  JeflEreys’s  prior  depends  upon  both  wx,  w2.  Recall  that  g  was  calculated 
in  section  2;  g"  follows  similarly.  In  the  case  of  of  no  censoring  ,  Ma  =  —(g'((3TXi)/g((3Txl))2 
which  depends  only  upon  w2.  Finally,  the  question  of  whether  a  proper  posterior  results 
under  the  above  specifications  cn  be  examined  using  ideas  in  Ibrahim  and  Laud  (1991).  No 
details  are  given  here. 


4  Implementing  the  Bayesian  model. 

Given  the  Bayesian  model  L(/3,  w1}  w2)  -f(/3\wi,  w2)  •/( wx)  •  /( w2),  all  inference  proceeds 
from  the  posterior  distribution  of  (/3,  Wx,  w2)  which  is  proportional  to  this  product.  Ana¬ 
lytic  investigation  of  this  p  -)-  (ri  —  1)  +  (r2  —  1)  dimensional  nonnormalised  joint  distribution 
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is  infeasible.  It  would  be  difficult  enough  to  standardize  this  form  much  less  to  calculate 
expectations,  margined  distributions,  etc.  Thus  we  adopt  a  sampling  based  approach  using  a 
Markov  chain  Monte  Carlo  algorithm  to  obtain  random  draws  essentially  from  this  posterior 
distribution.  In  particular,  a  version  of  the  Gibbs  sampler  (Gelfand  and  Smith,  1990)  is 
natural  since  the  complete  conditional  distributions  for  /3,  wx  and  w2  are  also  proportional 
to  the  nonnonnalized  posterior.  Because,  for  a  given  f3,  calculation  of  L(f. 3,  w2,  Wj)  requires 
n(i”i  +  r2)  incomplete  Beta  function  evaluations,  making  draws  directly  from  these  distri¬ 
butions  is  inconvenient.  Instead  we  utilise  a  Metropolis-within-Gibbs  algorithm  (Muller, 
1993)  with  a  multivariate  normal  proposal  density  for  /3,  and  Dirichlet  proposal  densities 
for  Wi  and  w2.  Under  a  logit  transformation  of  w;  to  r,  —  1  dimensional,  space  these  last 
proposals  could  be  multivariate  normal.  We  run  short  Metropolis  sub-chains,  typically  20  to 
50  iterations,  for  each  update  within  each  iteration  of  the  Gibbs  sampler.  Such  sub-chains 
are  attractive  in  that,  to  proceed  from  the  current  step  to  the  next  requires  only  one  new 
evaluation  of  the  likelihood.  Starting  points  for  replications  of  the  Gibbs  sampler  are  taken 
in  the  vicinity  of  the  maximum  likelihood  estimates  for  /3  under  H0  and  g0  with  random 
uniform  draws  for  the  Wj.  In  the  example  of  section  5  we  used  1000  parallel  chains  initially, 
adaptively  improving  the  proposal  densities,  following  Muller’s  suggestions,  for  typically  25 
iterations  after  which  we  ran  ten  parallel  Gibbs  chains  each  for  approximately  5000  iterations 
using  various  “convergence”  diagnostics  before. stopping. 

Let  us  denote  the  retained  output  of  the  Gibbs  sampler,  which  is  approximately  a  sample 
from  the  posterior,  by  (/3y",wJ-,w5-)  j  =  1,2, ...  ,m  where  typically  m  is  1000  to  2000.  This 
sample  enables  us  to  carry  out  any  desired  posterior  inference.  Such  inference  would  likely 
examine  the  marginal  posterior  distributions  of  the  coefficients.  The  posteriors  of  Ho  and  ho 
can  be  obtained  at  any  t  using  the  posterior  of  w2.  The  posterior  of  g  can  be  obtained  at 
any  g  using  the  posterior  of  w2. 
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5  Model  criticism  and  comparison 


Here  we  consider  informal  techniques  for  checking  model  assumptions  and  for  comparing 
models.  In  the  non  Bayesian  framework,  with  censored  data,  assessment  of  the  hazard 
function  is  usually  done  using  a  hazard  plot.  More  specifically,  a  parametric  specification 
for  Ho,  equivalently  ho,  can  be  checked  using  an  empirical  estimate  of  H0  or  h0.  Plots 
of  Ho,  logjEfo,  ho  or  logAo  vs  t  or  logf  can  be  compared  with  theoretical  plots  under  the 
parametric  specification.  An  alternative  is  a  residual  analysis  based  upon  the  earlier  remark 
that  given  /3,  Ho  and  g,  Ui  =  U(t,\  x,)  =  H0(ti)g((3TXi)  ~  Exp(l).  Taking  H0  and  g  as 
known  and  inser  ting  a  sample  estimate  for  /3,  the  resulting  Ui  are  compared- with  the  Exp(l) 
distribution  using  a  theoretical  or  empirical  QQ  plot  (see,  e.g.,  Lawless,  1982;  or  Cox  and 
Oakes,  1984).  Unfortunately  the  required  estimation  sacrifices  both  the  independence  and 
exact  distribution  for  the 

How  should  such  checking  be  handled  within  the  Bayesian  framework?  Generally,  model 
assessment  proceeds  from  predictive  distributions  (Box,  1980).  At  the  individual  level  this 
proposes,  for  the  observed  survival  times,  comparison  of  f(ti\data)  with  for  the  censored 
survival  times  evaluation  of  S(ti\data)  at  V{.  It  may  be  preferable  not  to  include  the  observed 
information  on  the  ith  individual  in  obtaining  the  predictive  distribution  for  U.  That  is,  we 
would  condition  on  data(i),  all  of  the  data  excluding  the  information  on  the  ith  individual. 
(Such  deletion  is  usually  referred  to  as  crossvalidation.)  Regardless,  Gelfand  and  Dey  (1993) 
describe  how  to  use  the  output  of  the  Gibbs  sample  to  obtain  Monte  Carlo  estimates  of 
either  of  these  predictive  densities  and  predictive  survivor  functions. 

For  instance  f(U\data)  —  m-1  where  /*(t,;  Xi)  denotes  the  density  (2) 

given  (/3*,  w*7-,  w^).  Similarly  S(U\daia)  =  m~l  YUJLi  5*(t,;  Xi)  where  5*(ti;  xt)  denotes  the 
survivor  function  associated  with  (2)  given  (/^.w^,  w^).  Appropriate  resampling  of  the 
(/3J,  wjj-,  w53)  enables  conditioning  on  data(i )  (Smith  and  Gelfand,  1992).  In  particular  for 
an  observed  t,  we  obtain  the  conditional  predictive  ordinate  (CPO)  as  f{ti<0bt\data(i)).  For 
an  unobserved  U  we  can  obtain  5(u,|data(z)).  A  large  CPO  indicates  agreement  between  the 
observation  and  the  model  (see  Pettit  and  Young,  1990).  Hence  models  can  be  compared 


using  a  plot  vs.  t  of,  e.g.,  CPO’s  under  the  various  models,  CPO  ratios,  or  log  CPO  ratios. 
Similarly,  for  the  censored  observations,  a  large  value  of  S(vi\data(i))  indicates  agreement 
between  the  observed  censoring  and  the  model. 


6  An  illustrative  example 

The  data  in  table  1  describes  survival  data  in  days  for  40  advanced  lung  cancer  patients  and 
is  taken  from  Lawless  (1982).  The  three  explanatory  variables  are  :  jc1}  performance  status 
at  diagnosis  (a  measure  on  a  scale  from  0  to  100  of  the  patient’s  general  medical  condition), 
x2,  age  of  the  patient  in  years,  and  X3,  months  from  diagnosis  to  entry  into  the  study.  Note 
that  three  of  the  40  survival  times,  indicated  by  *  are  censored.  An  exponential  model,  i.e., 
hc(t)=l,  g(rj )  =  e~v,  is  used  by  Lawless.  For  this  four  parameter  model  (intercept  and  three 
coefficients  )  maximum  likelihood  estimates  and  associated  standard  errors  can  be  obtained 
from  standard  statistical  packages  and  are  presented  in  table  2.  Accordingly,  only  xx  is  very 
important.  For  the  37  uncensored  survival  times,  figure  1  provides  a  theoretical  Q-Q  plot  for 

-  -  T 

the  Ui  =  ti  , j..ezp(— /3  Xi).  The  exponential  model  appears  adequate  but  would  we  prefer  a 
semiparametric  choice? 

We  investigated  four  models  from  a  Bayesian  perspective.  Model  1  has  a  likelihood 
arising  under  an  exponential  hazard  and  an  exponential  covariate  link  and  is  denoted  by 
BE.  Model  2  has  a  nonparametric  hazard  specification  with  an  exponential  covariate  link 
and  is  denoted  by  NE.  The  integrated  hazard  is  developed  as  in  section  2  “centered”  about 
the  exponential  hazard.  Model  3  has  an  exponential  hazard  but  a  nonparametric  covariate 
link  “centered”  about  the  expoponential  function  and  is  denoted  by  EN.  Finally  model 
4  incorporates  nonparametric  hazard  and  covariate  link  each  exponentially  centered  and 
denoted  by  NN.  When  nonparametric  forms  are  used  we  took  A  =  1  and  =  3.  We 
investigated  larger  t\’s  but  witnessed  inconsequential  improvement  in  fit  primarily  because, 
regardless  of  j%,  the  function  is  constrained  to  be  monotonic.  This  is  an  obvious  advantage 
to  our  modeling  approach.  We  can  fit  H0  and  g  using  a  large  number  of  mixands  but,  in 
practise,  a  much  more  parsimonious  choice  can  usually  be  taken.  We  used  a  fiat  prior  on 
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/3  and,  if  w,  appears  we  chose  a  uniform  prior  on  the  simplex  in  three  dimensions.  Hence 
model  1  has  4  parameters,  models  2  and  3  have  6  and  model  4  has  8. 

With  regard  to  model  choice,  in  figure  2  we  present  a  CPO  plot  for  each  of  the  four 
models.  The  three  semiparametric  choices  are  better  than  EE  with  the  largest  model,  NN, 
the  best.  For  each  of  the  three  censored  survival  times  under  each  model,  Table  3  gives 
P(U  >  Ui|data(i))  along  with  the  product  over  these  three  times.  Models  NE  and  NN  are 
best  at  explaining  the  censoring. 

Looking  at  model  choice  in  a  different  way  suppose  we  compare  the  estimated  nonpara- 

metric  functions  with  their  centering  exponentials.  As  the  Bayes  estimate. of  the  nonpara- 

metric  function  we  take  the  posterior  mean.  In  figure  3a  we  compare  the  integrated  hazards 

under  NE  and  under  NN  with  Ho(t)  =  t.  In  figure  3b  we  compare  the  hazards  under  NE 

and  under  NN  with  ho(t)  —  1.  The  domain  for  t  agrees  in  both  figures  and  includes  virtually 

all  the  observed  t{.  We  note  that  Ho  and  ho  are  almost  indistinguishable  under  NE  and 

NN  but  differ  dramatically  from  those  for  the  exponential  model.  Figure  3  shows  that  for 

this  data  the  aforementioned  claim  of  Cox  and  Oakes  (1984)  and  of  McCullagh  and  Nelder 

(1989)  holds:  if  H0  is  estimated,  the  fit  is  insensitive  to  the  covaxiate  link.  In  figure  4  we 

compare  the  covariate  links  under  EN  and  NN  with  g(rj)  =  ev.  It  is  convenient  to  use  the 

*  T 

ratio  $/ e”  plotted  over  a  range  which  includes  all  of  the  /3  x<.  The  g’s  under  EN  and  under 
NN  are  in  good  agreement  for  rj  >  0  with  growing  disagreement  as  rj  decreases. 

All  of  the  above  discussion  supports  NE  or  NN.  Looking  at  NN  in  more  detail  we  find  the 
posterior  mean  and  standard  deviations  for  the  /3,  under  NN  in  table  2.  In  terms  of  inference 
for  (3  alone  there  is  little  qualitatative  difference  between  EE  and  NN.  Finally,  in  table  4 
we  provide  the  posterior  correlations  amongst  the  8  parameters  under  NN.  As  expected, 
there  is  high  negative  correlation  between  Wn  and  wu  and  between  w2i  and  w22.  Otherwise, 
correlations  are  weak  and  Wj  (i.e.  Ho)  is  essentially  uncorrelated  with  w2  and  /3  (i.e.,  g). 

In  conclusion,  though  the  present  data  set  does  not,  perhaps,  warrant  a  more  elaborate 
model  than  EE,  it  is  attractive  to  have  a  broader  model  formulation  and  the  associated 
fitting  and  choice  tools  to  demonstrate  this. 
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Table  1:  Long  Cancer  Survival  Data  (Lawless,  1982) 
*  indicates  censored  observation 


t 

*1 

*2 

*3 

t 

*1 

*2 

*3 

411 

70 

64 

5 

100 

60 

37 

13 

126 

60 

63 

9 

999 

90 

54 

12 

118 

70 

65 

11 

231* 

50 

52 

8 

92 

40 

69 

10 

991 

70 

50 

7 

8 

40 

63 

58 

1 

20 

65 

21 

25* 

70 

48 

9 

201 

80 

52 

28 

11 

70 

48 

11 

44 

60 

70 

13 

54 

80 

63 

4 

15 

50 

40 

13 

153 

60 

63 

14 

103* 

70 

36 

22 

16 

30 

53 

4 

2 

40 

44 

36 

56 

80 

43 

12 

20 

30 

54 

9 

21 

40 

55 

2 

51 

30 

59 

87 

287 

60 

66 

25 

18 

40 

69 

5 

10 

40 

67 

23 

90 

60 

50 

22 

8 

20 

61 

19 

84 

80 

62 

4 

12 

50 

63 

4 

164 

70 

68 

is 

177 

50 

66 

16 

19 

30 

39 

4 

12 

40 

68 

12 

43 

60 

49 

11 

200 

80 

41 

12 

340 

80 

64 

10 

250 

70 

53 

8 

231 

70 

67 
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Table  2:  Summary  of  Likelihood  Analysis  for  EE 
Posterior  Analyis  for  NN 


EE 

mie(s.e.) 

NN 

post  mean 
(s.d.) 

*0 

4.742(.1612) 

3.932(.1668) 

A 

0.060(.Q09) 

0.035(.012) 

^2 

0.013(.Q15) 

0.001(.018) 

h 

.003(.010) 

0.003(.Q09) 

Table  3:  Comparison  Amongst  Models  of  Survival 
Probabilities  for  Censored  Observations 


Model 

(1) 

P(t1>25|data(l)) 

(2) 

P(t23>23|data(23)) 

(3) 

P(t29>103|data(29)) 

(1)(2)(3) 

EE 

0.8852 

0.0253 

0.5435 

0.0121 

NE 

0.9091 

0.1893 

0.7019 

0.1208 

EN 

0.8899 

0.0904 

0.6252 

0.0503 

NN 

0.9133 

0.1894 

0.7116 

0.1228 

Table  4:  Posterior  Correlations  under  Model  NN 


h 

h 

WU 

W12 

W21 

.194 

% 

.026 

.039 

h 

.101 

.267 

-.042 

-n 

.007 

-.017 

.011 

.018 

w12 

.004 

.011 

.015 

.019 

-.051 

W21 

.332 

.245 

-.102 

.130 

.019 

.018 

W22 

—.141 

.310 

-.160 

-.236 

-.706 

-.015 

-.813 

Figure  1:  Theoretical  Q-Q  Plot  for  the  U.  Under  Exponential  Model 
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and  NN 


Figure  4:  Comparison  of  Covarl 
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